Improve ROCm SDPA fallback behavior and MI350 support#1
Open
keithloweryamd wants to merge 2 commits intoandyluo7:masterfrom
Open
Improve ROCm SDPA fallback behavior and MI350 support#1keithloweryamd wants to merge 2 commits intoandyluo7:masterfrom
keithloweryamd wants to merge 2 commits intoandyluo7:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR carries a small ROCm-focused cleanup that keeps the training path stable on AMD Instinct while preserving the existing CUDA path.
AUTORESEARCH_CACHE_DIREFFICIENT_ATTENTIONfirst, then fall back toFLASH_ATTENTIONandMATHbf16through the softcap and cross-entropy path instead of forcing an intermediatefloat()WINDOW_PATTERNtoLon ROCm, since the current ROCm SDPA path is full-causal attention and does not implement the sliding-window variants used on the CUDA FA3 pathThe changes are intentionally narrow. They do not include local profiling hooks, workspace-specific wrapper scripts, TunableOp solution files, or other investigation artifacts.
Why
On the target MI350X environment, the previous ROCm path was functional but left performance on the table and could select slower SDPA behavior depending on the runtime/backend combination. The fallback logic makes the desired efficient attention backend explicit while still keeping training runnable if that backend is unavailable.
The
WINDOW_PATTERNadjustment also makes the ROCm behavior explicit rather than silently pretending to run the CUDA-sideSSSLpattern when the actual backend is full causal attention.MI350X Performance
Using the saved 300-second end-to-end baselines from the investigation:
722-731 ms/step,~717k-726k tok/s,222.3Mtokens in 300s597-606 ms/step,~865k-878k tok/s,267.9Mtokens in 300sThat corresponds to:
1.21xend-to-end speedup+20.5%more tokens processed in the same 5-minute budget105632.2 MBto97440.2 MBThere was also a transient regression during the ROCm attention work when forcing a less favorable SDPA path; the final runtime fallback in this PR is the change that recovered and stabilized the faster path.
Validation
python -m py_compile prepare.py train.pytrain.pyperformance ledger indev/performance_checkpoints.md